Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles for short trips, typically 30 minutes or less. Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used. In this project, an exploratory analysis will be done over Ford GoBike data, a bike-share system provider.
Multiple data files will need to be joined together in case a full year’s coverage is needed. In this project, we will focus on the record of individual trips taken in from June, 2018 to May, 2019.
The features included in this dataset : Trip Duration (seconds) , Start Time and Date , End Time and Date , Start Station ID , Start Station Name , Start Station Latitude , Start Station Longitude , End Station ID , End Station Name, End Station Latitude , End Station Longitude , Bike ID , User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual) , Member Year of Birth, Member Gender
# import all packages and set plots to be embedded inline
import os
import time
import glob
import numpy as np
import pandas as pd
import helpers as hp
import plotly.express as px
import plotly.graph_objects as go
Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.
# Reading all csv files at once and append them at one dataframe
path = r"D:\GitLabRepos\GoBike\gobike"
all_files = glob.glob(os.path.join(path, "*.csv"))
print("Concatincating files to one file...")
start_time = time.time()
df = pd.concat(
(pd.read_csv(
file, parse_dates=['start_time', 'end_time', 'member_birth_year'],
dtype={"start_station_id":"O", "end_station_id":"O", "bike_id":"O"},
nrows=10000
) for file in all_files), ignore_index=True
)
end_time = time.time()
print("done!")
print("It tooks {} seconds to read and concatnate datasets".format(round(end_time - start_time, 2)))
df.head()
# exploring NaNs
hp.explore_nans(df, "Exploring NaNs", chart_image=False)
# do some data cleaning : drop nan , remove some insignificant features, adding trip duration in minutes feature
df.drop(['start_station_latitude','start_station_longitude', 'end_station_latitude', 'end_station_longitude'], axis=1, inplace=True)
df.dropna(inplace=True)
df.shape
It has a shape of 12 columns and 113327 rows after concatinating the 12-months data files available
- What is the average trip duration?
- Is season a factor to affect on trip duration?
- Is season vs months a factor of interest to affect on trip duration?
'duration_sec', 'start_time', 'end_time', 'user_type', 'member_gender'
In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.
Question 1 : What is the average trip duration?
# Let's plot the distribution of trip duration.
data = go.Histogram(x=df["duration_sec"])
layout = go.Layout(
title="distribution of trip duration", xaxis={"showgrid":False}, yaxis= {"showgrid":False}
)
fig = go.Figure(data, layout)
fig.show()
data = go.Box(y=df["duration_sec"], name="Trip Durations in seconds")
layout = go.Layout(
title="distribution of trip duration", xaxis={"showgrid":False}, yaxis= {"showgrid":False}
)
fig = go.Figure(data, layout)
fig.show()
- From the two charts above, I can see a right skewed with a long tail on the right meaning that the data has a low variance.
- Although these distribution charts showed us the shape of the data, they're still not enough to interpret and they need more percise look. A way to enhance this is to use log transformation which will be done on the next cell.
- Measuring trip using seconds is not reasonable. I would change the
duration_sectoduration_minbefore I add any log transformation.
# adding new columns to answer the question more precisely
df["duration_min"] = df["duration_sec"] / 60
df['duration_min_log'] = np.log10(df['duration_min'])
# Let's plot the distribution of trip duration.
data = go.Histogram(x=df["duration_min_log"])
layout = go.Layout(
title="Distribution of trip duration after log transformation",
xaxis={"showgrid":False, "title":"Duration in min"},
yaxis= {"showgrid":False, "title":"Frequency"}
)
fig = go.Figure(data, layout)
fig.show()
data = go.Box(y=df["duration_min_log"], name="Trip Duration")
layout = go.Layout(
title="Distribution of trip duration after log transformation", xaxis={"showgrid":False}, yaxis= {"showgrid":False}
)
fig = go.Figure(data, layout)
fig.show()
As seen before, it is difficult to read the plot in trip duration per second so that I tend to perform log transformation based on base 10 to plot a normally distributed shape and answer the question precisely. It looks like that most of the trips takes 10 minutes in average - short trips.
Yes, I created a new variable called
duration_min_logthat holds the trip duration in minutes after a log transformation has been done.
In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).
Question 2. Is season a factor to affect on trip duration?
Answering this question requires creating new columns from the
start_timecolumn;start_month,start_day,start_hourandsummercolumn for heatmap plot.
# create the columns
df['start_month'], df['start_day'], df['start_hour'] = (
df['start_time'].dt.month_name(),
df['start_time'].dt.day_name(),
df['start_time'].dt.hour,
)
df["season"] = df.apply(hp.seasons, axis=1)
df.head()
# seasons vs median duration trips
season_duration_mean = df.groupby('season')['duration_min'].median().reset_index()
fig = go.Figure(
go.Bar(
x=season_duration_mean['season'].tolist(),
y=season_duration_mean['duration_min'].tolist(),
text=round(season_duration_mean['duration_min'], 2).astype(str).tolist(),
textposition="auto"
),
go.Layout(
title="Average of duration trip per season in minutes",
xaxis={"showgrid":False, "title":"Season"},
yaxis={"showgrid":False, "title":"Duration Trip Mean"}
)
)
fig.show()
Due to outliers that exit heavily in this data, I chose to measure the average by median not mean to not mislead the results. Despite there is no significant difference in trip dutaion across seasons, the plot appears that spring has the longest median of trip duration. This was expected for me as in spring we have a very relxing whether experience that motivates going bicycling.
Actually, whether doesn't affect that much in SF. I don't know why but this might go back to unchanging extreme whether conditions.
Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.
Question 3. Is season vs months/days a factor of interest to affect on trip duration?
fig = go.Figure(
go.Heatmap(
z=df['duration_min_log'].tolist(),
x=df['start_month'].tolist(),
y=df['season'].tolist(),
# hoverongaps = False
),
go.Layout(
title="Relationship between trip duration and months across year",
xaxis={"showgrid":False, "title":"Months"},
yaxis={"showgrid":False, "title":"Trip Duration in Mintues"},
xaxis_type="category"
)
)
fig.show()
fig = go.Figure(
go.Heatmap(
z=df['duration_min_log'].tolist(),
x=df['start_day'].tolist(),
y=df['season'].tolist(),
# hoverongaps = False
),
go.Layout(
title="Relationship between trip duration and days across week",
xaxis={"showgrid":False, "title":"Months"},
yaxis={"showgrid":False, "title":"Trip Duration in Mintues"},
xaxis_type="category"
)
)
fig.show()
I've created season column that to plot multivariate exploration between season, months and trip duration. From the heatmap above, we can see that longest trip durations are in the summer specifically in August. Winter, in Sep and Jan, comes in the second place, while spring months come with shortest trip durations across the year. Unlike the previous bar chart that shows the longest median duratoin trip is in spring, the heatmap suggests the summer as duration count with longest ones. The second heatmap also prove the same fact that the summer has the peak duration of trips with high frequency in Wednesdays
An interesting point is that useres take long duration trips in the summer although it's hot.